© USO—iStock/Getty Images
© USO—iStock/Getty Images

1. Introduction

1.1 Background and Motivation

At first, we wanted to analyse the impacts of the recent global pandemic on individuals within the context of a variety of mental health conditions, namely anxiety and depression, in a pre- and post-COVID framework. However, we decided to drop the idea due to a lack of data available regarding the topic, which wouldn’t have allowed us to conduct a thorough analysis. As we were searching for new topic ideas, we found a dataset that concerns shark attacks around the world. A group member has then told us about an incident that occurred during his last holidays in Egypt, involving a shark attack at a paradisiac beach. We discussed thereafter the presence of these predators in many different regions in the world including Florida, Hawaii, and Australia just to name a few. For this reason, we decided to dive deeper into the topic of shark attacks and explore potential influencing factors. After a few days of research and information gathering, we found that a wide range of factors are capable of influencing such incidents, both directly and indirectly. We all enjoy going on holidays and enjoy beaches while drinking a cocktail, and when we feel that temperature increases, we go in the water to refresh, but depending on the location we are, this could be unsafe.

Another motivation behind this project is our interest in the preservation of marine fauna. Very often human beings tend to primarily blame the animals for their fierce behavior (just look at the example of the attack that occurred in Hurghada during summer of 2023: after the death of the victim, the shark was killed!) but we often forget that the water it is their natural habitat, and that we are the guests. This point motivates us even more to study the phenomenon of attacks and to disclose how human activity causes shark attacks.

1.2 Project Objective

The project’s focus is to provide an understanding of the complex dynamics of shark attacks globally, by conducting an in-depth analysis on the different factors that influence the phenomenon. In addressing the topic at hand, we explore both direct and indirect influencing factors.

At the end of our work we want to learn how it is possible to reduce accidents caused by sharks. On the one hand, this is crucial to guarantee the safety of people who take pleasure in coastal waters and engage in water-related activities. On the other hand, the conservation of aquatic animals might benefit from this. Making efforts to preserve these species is essential for sustaining the health of marine ecosystems in a time when the extinction rate of many animal species is rising. Among the factors that might have an impact on the trend of shark attacks we have climatic variations. Ocean temperatures and currents are capable of affecting migration patterns and habitat preferences, which further complicates the dynamics of shark attacks. By providing a step-by-step analysis on climate-related aspects, we aim to provide empirical evidence on the impact of environmental changes on shark movements and interactions with humans.

This project’s goal is to contribute to the understanding of shark-related incidents and thereby to provide possible strategies such as policymaking, which would help mitigate such occurrences, ultimately safeguarding human life and protecting sharks. In essence, we aim to advance research on sustainable coexistence with the marine environment through actionable insights and intend to contribute to the broader efforts of marine conservation.

1.4 Research Questions

  1. What are the primary factors influencing the occurrence of shark attacks?
  2. Is it true that shark attacks have been increasing with climate change?
  3. What are the main factors leading to the fatality of a shark attack?

2. Data Sets

In the second section of our work, we will present the datasets that are necessary to answer our research questions. These datasets serve as the foundation for our investigation and provide the data required to draw insightful and important findings. The purpose of this part is to provide a clear overview of the data landscape and make clear to the reader what are the information to be found in each web-based data, and how do they interrelate.

2.1 Shark Attacks Data Set

Our first dataset contains the information in over 6000 shark attacks that have occurred worldwide over the course of the past centuries. The dataset was obtained from the Shark Research Institute’s (SRI) website, which carries out and funds rigorous, peer-reviewed shark field research and instructs people using scientific data.

Even though the dataset is owned by the SRI, we have downloaded it through the following plateform: https://www.kaggle.com/datasets/teajay/global-shark-attacks.


# Create a 2x2 table. The first column represents the name of variables, the second one the description of each variable

Variables <- c("Case Number", "Date", "Year", "Type", "Country", "Area", "Location", "Activity", "Name", "Sex", "Age", "Injury", "Fatal", "Time", "Species", "Investigator.or.Source", "pdf", "href.formula", "Case Number 2")
Description <- c("Case Number of accident", "Date of the accident", "Year of accident", "Whether the victim provoked the shark or not", "Country of accident", "Region", "Location", "Activity the victim was doing before accident","Name of the victim", "Sex of the victim", "Age of the victim","Injury provoked by the accident", "Whether the accident led to death or not (Binary variable)", "Time of accident", "Shark Species", "Investigator of the accident", "Pdf information of the accident", "Pdf link of the accident", "Case number (same aas the first one)")


Table1 <- data.frame(Variable = Variables, Description = Description)

# The knitr::kable() allows to visualize the table
knitr::kable(Table1)
Variable Description
Case Number Case Number of accident
Date Date of the accident
Year Year of accident
Type Whether the victim provoked the shark or not
Country Country of accident
Area Region
Location Location
Activity Activity the victim was doing before accident
Name Name of the victim
Sex Sex of the victim
Age Age of the victim
Injury Injury provoked by the accident
Fatal Whether the accident led to death or not (Binary variable)
Time Time of accident
Species Shark Species
Investigator.or.Source Investigator of the accident
pdf Pdf information of the accident
href.formula Pdf link of the accident
Case Number 2 Case number (same aas the first one)

2.2 Temperatures Data Set

Our second dataset contains the average temperature change for each country (annually, monthly and seasonally) from 1961 to 2022. It contains 9,657 observations.

The dataset was taken from the website of the Food and Agriculture Organization of United Nations: https://www.fao.org/faostat/en/#data/ET


# Create a 2x2 table. The first column represents the name of variables, the second one the description of each variable

Variables <- c("Area Code", "Area", "Months ", "Month code ", "Element Code ", "Element", "Unit", "Years")
Description <- c("Code for each country", "Country for which the data has been collected", "Month during which the data on temperature has been collected", "Code of each month ", "Binary variable (7271: Temperature change; 6078: Standard Deviation)", "What is the data measuring (either temperature change or standard deviation of data collection)", "Unit of measure of temperature ", "Year of observations" )


Table2 <- data.frame(Variable = Variables, Description = Description)

# The knitr::kable() allows to visualize the table
knitr::kable(Table2)
Variable Description
Area Code Code for each country
Area Country for which the data has been collected
Months Month during which the data on temperature has been collected
Month code Code of each month
Element Code Binary variable (7271: Temperature change; 6078: Standard Deviation)
Element What is the data measuring (either temperature change or standard deviation of data collection)
Unit Unit of measure of temperature
Years Year of observations

2.3 Sea Level Data Set

Our third dataset contains information on the global increase of sea level from 1993 to 2021. The following acronims are important to be noted: GMLS:Global mean sea level GIA: Global Isostatic Adjustment, which is the ongoing movement of land once burdened by ice-age glaciers (National Ocean Service, n.d.)

The dataset was taken from the website of the NASA: https://climate.nasa.gov/vital-signs/sea-level/


# Create a 2x2 table. The first column represents the name of variables, the second one the description of each variable

Variables <- c("Year", "TotalWeightedObservations ", "GMSL_noGIA ", "StdDevGMSL_noGIA ", "SmoothedGSML_noGIA ", "GMSL_GIA", "StdDevGMSL_GIA", "SmoothedGSML_GIA", "SmoothedGSML_GIA_sigremoved")
Description <- c("Year of the observation", "Total Weighted Observations of Sea Level", "Global mean sea level variation in millimeters, with respect to 20-year TOPEX/Jason collinear mean reference (without considering the GIA)", "Standard Deviation of global mean sea level variation estimate in millimeters (without considering the GIA)", "Smoothed (60-day Gaussian type filter) GMSL variation in millimeters (without considering the GIA)", "Global mean sea level variation in millimeters, with respect to 20-year TOPEX/Jason collinear mean reference (considering the GIA)", "Standard Deviation of global mean sea level variation estimate in millimeters (considering the GIA)", "Smoothed (60-day Gaussian type filter) GMSL variation in millimeters (considering the GIA)", "Smoothed (60-day Gaussian type filter) Global Mean Sea Level variation in millimeters (considering the GIA) ; annual and semi-annual signal removed ")


Table3 <- data.frame(Variable = Variables, Description = Description)

# The knitr::kable() allows to visualize the table
knitr::kable(Table3)
Variable Description
Year Year of the observation
TotalWeightedObservations Total Weighted Observations of Sea Level
GMSL_noGIA Global mean sea level variation in millimeters, with respect to 20-year TOPEX/Jason collinear mean reference (without considering the GIA)
StdDevGMSL_noGIA Standard Deviation of global mean sea level variation estimate in millimeters (without considering the GIA)
SmoothedGSML_noGIA Smoothed (60-day Gaussian type filter) GMSL variation in millimeters (without considering the GIA)
GMSL_GIA Global mean sea level variation in millimeters, with respect to 20-year TOPEX/Jason collinear mean reference (considering the GIA)
StdDevGMSL_GIA Standard Deviation of global mean sea level variation estimate in millimeters (considering the GIA)
SmoothedGSML_GIA Smoothed (60-day Gaussian type filter) GMSL variation in millimeters (considering the GIA)
SmoothedGSML_GIA_sigremoved Smoothed (60-day Gaussian type filter) Global Mean Sea Level variation in millimeters (considering the GIA) ; annual and semi-annual signal removed

2.4 Greenhouse Gas Emissions Data Set

Our fourth dataset relates to GHG emissions, one of the main concerns linked to global warming. The dataset contains 31.315 observations on annual GHG emissions per country and per continent, but we will focus on the first case only.

The dataset was taken from the website Our World in Data: https://ourworldindata.org/annual-co2-emissions


# Create a 2x2 table. The first column represents the name of variables, the second one the description of each variable

Variables <- c("Entity", "Code", "Year", "Annual COâ‚‚ emissions")
Description <- c("Country examinated", "ISO Code", "Year of the observation", "GHG emissions per year")


Table4 <- data.frame(Variable = Variables, Description = Description)

# The knitr::kable() allows to visualize the table
knitr::kable(Table4)
Variable Description
Entity Country examinated
Code ISO Code
Year Year of the observation
Annual COâ‚‚ emissions GHG emissions per year

3. Data Wrangling

The first step of our work is data cleaning. While collecting data for a study, it might happen to have some variables with missing information, worthless data or spelling mistakes In order to run a regression, it is fundamental that each column within a data set has no missing values and that information is written following the same format. This procedure takes the name of data cleaning, and it will cover the whole subsection.

Please be aware that at the beginning of each subsection we only provide general instructions of our cleaning. Through the code, you will get additional in-depth explanations of each step we took.

3.1 Attacks Cleaning

The first dataset we will examine is the one pertaining to shark attacks. It was quite full of misspellings and missing values, which required a lot of time and a very dense code to be able to clean it.

For our work, we have decided to only study the phenomenon of shark attacks from 1970 forward. In fact, it is possible to observe in the raw dataset that, before to this date, observations had more erroneous information. This leads one to believe that, more likely than not, the information we needed is more accurate if we do not travel too far back in time.

While cleaning the column of Ages, we realized that NA were too much and we could not delete them all, because that would have meant losing 29,5% of our data. Therefore, we have decided to work in the following way: for each country that presents LESS NA than the total observation, we compute an histogram (Age against frequency) of non-missing data. If the distribution is normal, we substitute the NA with the mean. If it is left skewed, we substitute the NAs with the first quantile, if it is right skewed, we substitute it with the third quantile.

Here is an example with USA and Bahamas. For USA, the distribution of the non-missing victims’ ages is skewed on the left. Therefore we have decided to substitute all the missing values with the first quantile. For Bahamas, it is evident that data follow a normal distribution, therefore we replace NAs with the mean.

#EGYPT#

ages_in_egy <- attacks %>%
  filter(Country == "EGYPT") %>%
  select(Age)

ages_vector <- ages_in_egy$Age

ages_without_na = na.omit(ages_vector)
ages_without_na <- as.numeric(ages_without_na)
#hist(ages_without_na) -> We put this function preceded by an hashtag because we do not want the histogram to be displayed for all countries
mean_egy = mean(ages_without_na)

attacks$Age[attacks$Country == "EGYPT" & is.na(attacks$Age)] <- mean_egy


#ITALY#

# Extract all ages from individuals who come from Italy
ages_in_italy <- attacks %>%
  filter(Country == "ITALY") %>%
  select(Age)

# Convert the extracted ages to a vector
ages_vector <- ages_in_italy$Age


# Display or use the 'ages_vector' containing ages from individuals in Italy

ages_without_na = na.omit(ages_vector)
ages_without_na <- as.numeric(ages_without_na)
#hist(ages_without_na) -> We put this function preceded by an hashtag because we do not want the histogram to be displayed for all countries
mean_age = mean(ages_without_na)

# Replace all the NA on Italy with the mean of people attacked in Italy.

attacks$Age[attacks$Country == "ITALY" & is.na(attacks$Age)] <- 37.375

#AUSTRALIA#

ages_in_aus <- attacks %>%
  filter(Country == "AUSTRALIA") %>%
  select(Age)

ages_vector <- ages_in_aus$Age

ages_without_na = na.omit(ages_vector)
ages_without_na <- as.numeric(ages_without_na)
#hist(ages_without_na) -> We put this function preceded by an hashtag because we do not want the histogram to be displayed for all countries
mean_aus = mean(ages_without_na)

attacks$Age[attacks$Country == "AUSTRALIA" & is.na(attacks$Age)] <- mean_aus



#SOUTH AFRICA#

ages_in_SA <- attacks %>%
  filter(Country == "SOUTH AFRICA") %>%
  select(Age)

ages_vector <- ages_in_SA$Age

ages_without_na = na.omit(ages_vector)
ages_without_na <- as.numeric(ages_without_na)
#hist(ages_without_na) -> We put this function preceded by an hashtag because we do not want the histogram to be displayed for all countries
mean_SA = mean(ages_without_na)

attacks$Age[attacks$Country == "SOUTH AFRICA" & is.na(attacks$Age)] <- mean_SA


#BRAZIL#

ages_in_BRA <- attacks %>%
  filter(Country == "BRAZIL") %>%
  select(Age)

ages_vector <- ages_in_BRA$Age

ages_without_na = na.omit(ages_vector)
ages_without_na <- as.numeric(ages_without_na)
#hist(ages_without_na) -> We put this function preceded by an hashtag because we do not want the histogram to be displayed for all countries
mean_BRA = mean(ages_without_na)

attacks$Age[attacks$Country == "BRAZIL" & is.na(attacks$Age)] <- mean_BRA


#FIJI#

ages <- attacks %>%
  filter(Country == "FIJI") %>%
  select(Age)

ages_vector <- ages$Age

ages_without_na = na.omit(ages_vector)
ages_without_na <- as.numeric(ages_without_na)
#hist(ages_without_na) -> We put this function preceded by an hashtag because we do not want the histogram to be displayed for all countries
mean = mean(ages_without_na)

attacks$Age[attacks$Country == "FIJI" & is.na(attacks$Age)] <- mean

#FRENCH POLYNESIA#

ages <- attacks %>%
  filter(Country == "FRENCH POLYNESIA") %>%
  select(Age)

ages_vector <- ages$Age

ages_without_na = na.omit(ages_vector)
ages_without_na <- as.numeric(ages_without_na)
#hist(ages_without_na) -> We put this function preceded by an hashtag because we do not want the histogram to be displayed for all countries
mean = mean(ages_without_na)

attacks$Age[attacks$Country == "FRENCH POLYNESIA" & is.na(attacks$Age)] <- mean

#HONG KONG#

ages <- attacks %>%
  filter(Country == "HONG KONG") %>%
  select(Age)

ages_vector <- ages$Age

ages_without_na = na.omit(ages_vector)
ages_without_na <- as.numeric(ages_without_na)
#hist(ages_without_na) -> We put this function preceded by an hashtag because we do not want the histogram to be displayed for all countries
mean = mean(ages_without_na)

attacks$Age[attacks$Country == "HONG KONG" & is.na(attacks$Age)] <- mean

#JAPAN#

ages <- attacks %>%
  filter(Country == "JAPAN") %>%
  select(Age)

ages_vector <- ages$Age

ages_without_na = na.omit(ages_vector)
ages_without_na <- as.numeric(ages_without_na)
#hist(ages_without_na) -> We put this function preceded by an hashtag because we do not want the histogram to be displayed for all countries
mean = mean(ages_without_na)

attacks$Age[attacks$Country == "JAPAN" & is.na(attacks$Age)] <- mean

#MEXICO#

ages <- attacks %>%
  filter(Country == "MEXICO") %>%
  select(Age)

ages_vector <- ages$Age

ages_without_na = na.omit(ages_vector)
ages_without_na <- as.numeric(ages_without_na)
#hist(ages_without_na) -> We put this function preceded by an hashtag because we do not want the histogram to be displayed for all countries
mean = mean(ages_without_na)

attacks$Age[attacks$Country == "MEXICO" & is.na(attacks$Age)] <- mean


#MOZAMBIQUE#

ages <- attacks %>%
  filter(Country == "MOZAMBIQUE") %>%
  select(Age)

ages_vector <- ages$Age

ages_without_na = na.omit(ages_vector)
ages_without_na <- as.numeric(ages_without_na)
#hist(ages_without_na) -> We put this function preceded by an hashtag because we do not want the histogram to be displayed for all countries
mean = mean(ages_without_na)

attacks$Age[attacks$Country == "MOZAMBIQUE" & is.na(attacks$Age)] <- mean


#NEW ZELAND#

ages <- attacks %>%
  filter(Country == "NEW ZEALAND") %>%
  select(Age)

ages_vector <- ages$Age

ages_without_na = na.omit(ages_vector)
ages_without_na <- as.numeric(ages_without_na)
#hist(ages_without_na) -> We put this function preceded by an hashtag because we do not want the histogram to be displayed for all countries
mean = mean(ages_without_na)

attacks$Age[attacks$Country == "NEW ZEALAND" & is.na(attacks$Age)] <- mean

#PHILIPPINES#

ages <- attacks %>%
  filter(Country == "PHILIPPINES") %>%
  select(Age)

ages_vector <- ages$Age

ages_without_na = na.omit(ages_vector)
ages_without_na <- as.numeric(ages_without_na)
#hist(ages_without_na) -> We put this function preceded by an hashtag because we do not want the histogram to be displayed for all countries
mean = mean(ages_without_na)

attacks$Age[attacks$Country == "PHILIPPINES" & is.na(attacks$Age)] <- mean

#REUNION#

ages <- attacks %>%
  filter(Country == "REUNION") %>%
  select(Age)

ages_vector <- ages$Age

ages_without_na = na.omit(ages_vector)
ages_without_na <- as.numeric(ages_without_na)
#hist(ages_without_na) -> We put this function preceded by an hashtag because we do not want the histogram to be displayed for all countries
mean = mean(ages_without_na)

attacks$Age[attacks$Country == "REUNION" & is.na(attacks$Age)] <- mean

#NEW CALEDONIA#

ages <- attacks %>%
  filter(Country == "NEW CALEDONIA") %>%
  select(Age)

ages_vector <- ages$Age

ages_without_na = na.omit(ages_vector)
ages_without_na <- as.numeric(ages_without_na)
#hist(ages_without_na) -> We put this function preceded by an hashtag because we do not want the histogram to be displayed for all countries
mean = mean(ages_without_na)

attacks$Age[attacks$Country == "NEW CALEDONIA" & is.na(attacks$Age)] <- mean


#We fixed those countries that had lots of NA. Now, for the remaining NA we decided not to do anything for a specific reason. Not only we have not cound the real information on the internet, but we also believe that doing the mean for those remaining situations was useless, because these are the cases when NA are either the same amount of total shark attacks (see Antigua with 1 shark attack and 1 NA), or NA are more than half of the total attacks (see Malaysia with 4 total attacks but 3 of them are NA).


count_na_by_country <- attacks %>%
  group_by(Country) %>%
  summarise(NA_count = sum(is.na(Age)))


attacks <- subset(attacks, !is.na(Age))

#___________________________________________________________________________________________________________________


#NOW WORK ON TIME

#sometimes hours are written in a different format compared to most of the others, which follow the format
#13h30. Since they are not a lot we just replaced them manually

attacks<- mutate(attacks, Time= ifelse(Time=="Possibly same incident as 2000.08.21", "", Time))
attacks<- mutate(attacks, Time= ifelse(Time=="14h00  -15h00", "14h30", Time))
attacks<- mutate(attacks, Time= ifelse(Time=="14h30 / 15h30", "15h00", Time))
attacks<- mutate(attacks, Time= ifelse(Time=="09h30 / 10h00", "9h45", Time))
attacks<- mutate(attacks, Time= ifelse(Time=="10h45-11h15", "11h00", Time))
attacks<- mutate(attacks, Time= ifelse(Time=="Sometime between 06h00 & 08hoo", "7h00", Time))
attacks<- mutate(attacks, Time= ifelse(Time=="18h15-18h30", "18h20", Time))
attacks<- mutate(attacks, Time= ifelse(Time=="09h00 - 09h30", "9h15", Time))
attacks<- mutate(attacks, Time= ifelse(Time=="10h00 -- 11h00", "10h30", Time))
attacks<- mutate(attacks, Time= ifelse(Time=="11h115", "11h15", Time))
attacks <- mutate(attacks, Time = ifelse(Time == "Between 05h00 and 08h00", "6h30", Time))
attacks <- mutate(attacks, Time = ifelse(Time == "17h00 or 17h40", "17h20", Time))
attacks <- mutate(attacks, Time = ifelse(Time == "13h345", "13h45", Time))
attacks <- mutate(attacks, Time = ifelse(Time == "<a0> ", "", Time))
attacks <- mutate(attacks, Time = ifelse(Time == "09h00 -10h00", "9h30", Time))
attacks <- mutate(attacks, Time = ifelse(Time == "2 hours after Opperman", "", Time))
attacks <- mutate(attacks, Time = ifelse(Time == "11h00 / 11h30", "11h15", Time))
attacks <- mutate(attacks, Time = ifelse(Time == "Between 06h00 & 07h20", "6h40", Time))
attacks <- mutate(attacks, Time = ifelse(Time == "30 minutes after 1992.07.08.a", "", Time))



#now, our objective is to classify hours in parts of the day. indeed, it is useless to keep hours as they are
#for a regression because we want times like 7h30 and 7h00 OR 15h00 and 16h00 to be read as the same thing.
# some of the rows already had the part of the day in it, but in order to work easily with all the column, we
#replaced "morning" with 8h00 etc. in this way, we're able to remove all the strings that are useless. finally, 
#void rows have been replaced by an NA and hours, which were there as CHAR have been replaced by numeric

attacks$Time <- str_replace_all(attacks$Time, "\\bmorning\\b", "8h00")
attacks$Time <- str_replace_all(attacks$Time, "\\bafternoon\\b", "15h00")
attacks$Time <- str_replace_all(attacks$Time, "\\bevening\\b", "20h00")
attacks$Time <- str_replace_all(attacks$Time, "\\bnight\\b", "23h00")

attacks$Time <- gsub("[^[:digit:]]", "", attacks$Time)
attacks$Time = na_if(attacks$Time, "")
attacks$Time <- as.numeric(attacks$Time)

# since we removed all the letters, hours now are not written as 8h00 or 15h30 but as 800 and 15h00. this is
# great for us, so that we can easily create a function that classifies those numbers as parts of the day. the 
#function works this way: if a value is included between 0 and 500 (i.e. midnight and 5a.m.), the value is replaced
#by the word "night" etc.

timeoftheday <- function(time) {
  if (!is.na(time)) {
    if (time >= 0 && time < 500) {
      return("night")
    } else if (time >= 500 && time < 1200) {
      return("morning")
    } else if (time >= 1200 && time < 1730) {
      return("afternoon")
    } else if (time >= 1730 && time < 2400) {
      return("evening")
    } else {
      return("")  # Handle any other cases (if necessary)
    }
  } else {
    return(NA)  # Retain NA values
  }
}

attacks$Time <- sapply(attacks$Time, timeoftheday)
attacks$Time <- tolower(attacks$Time)
attacks$Time <- na_if(attacks$Time, "")


sum(is.na(attacks$Time)) #this is the only one that still presents 1415 NA. we dont want to delete
#them all coz we'd lose 44% of our data. 
table(attacks$Time)#table shows that most of them happen in the afternoon, while 2ns place is owned by
#morning. To confirm the higher frequency of attacks between 8am and 6pm is this artile (link JC found??)
#we explain this by the fact that, naturally, most of people swim during the day. therefore, what we do
#is replacing NA randomly by the proportion of afternoon, morning and evening.

sum(!is.na(attacks$Time))

#create function that substitutes the NA in time with one of the 4 parts of the day, based on their
#proportion presence in our dataset

non_na_time_proportions <- table(attacks$Time) / sum(!is.na(attacks$Time))


# Identify NA positions
position_of_na <- is.na(attacks$Time)

# Generate random values based on proportions
attacks$Time[position_of_na] <- sample(
  names(non_na_time_proportions),
  sum(position_of_na),
  replace = TRUE,
  prob = as.vector(non_na_time_proportions)
)


#_____________________________________________________________________________________________
#NOW WE WORK ON SEX

attacks$Sex <- gsub("lli", "", attacks$Sex)
attacks$Sex <- gsub("M ", "M", attacks$Sex)
attacks$Sex <- na_if(attacks$Sex, "")
attacks$Sex <- ifelse(is.na(attacks$Sex), "Unknown", attacks$Sex)

#__________________________________________________________________________________________________________________


#NOW WE WORK ON SHARK SPECIES


#Creation of a variable species where we delete all the numbers that concern the size of the shark
attacks$Species <- str_remove_all(attacks$Species, "[[:digit:]]")

#If there is an empty cell, we will put NA
species = na_if(attacks$Species, "")

#gsub to take off punctuation
species <- gsub("[[:punct:]]", "", species)

#Delete all the numbers followed by cm or m
species <- gsub("\\d+\\s*(cm|m)\\b", "", species)

#Delete word composed by 1 or 2 letters
species <- gsub("\\b\\w{1,2}\\b", "", species)

#Delete all the numbers
species <- gsub("\\d", "", species)

#Delete all unities as lb, kg or l
species <- gsub("\\b(kg|lb|b)\\b", "", species)

#If the word shark appears more than once in each cell, we only write shark once
species <- gsub("\\bshark\\b(.*\\bshark\\b)?", "shark", species, ignore.case = TRUE)


# Rewrite shark species names in one way, and do that for each species
species<- gsub("^.*White shark.*$", "White shark", species)
species<- gsub("^.*whitetip shark.*$", "White shark", species)
species<- gsub("^.*white shark.*$", "White shark", species)
species<- gsub("^.*Wobbegong shark.*$", "Wobbegong shark", species)
species<- gsub("^.*Wobbegong.*$", "Wobbegong shark", species)
species<- gsub("^.*Zambesi shark.*$", "Zambesi shark", species)
species<- gsub("^.*Zambezi shark.*$", "Zambesi shark", species)
species<- gsub("^.*whaler shark.*$", "Whale shark", species)
species<- gsub("^.*whale shark.*$", "Whale shark", species)
species<- gsub("^.*tiger shark.*$", "Tiger shark", species)
species<- gsub("^.*Tiger shark.*$", "Tiger shark", species)
species<- gsub("^.*thresher shark.*$", "Thresher shark", species)
species<- gsub("^.*spinner shark.*$", "Spinner shark", species)
species<- gsub("^.*Spinner shark feet.*$", "Spinner shark", species)
species<- gsub("^.*Spinner shark.*$", "Spinner shark", species)
species<- gsub("^.*spinner  blacktip shark.*$", "Spinner shark", species)
species<- gsub("^.*Tawny nurse shark.*$", "Nurse shark", species)
species<- gsub("^.*nurse shark.*$", "Nurse shark", species)
species<- gsub("^.*Nurse shark.*$", "Nurse shark", species)
species<- gsub("^.*Mako shark.*$", "Mako shark", species)
species<- gsub("^.*mako shark.*$", "Mako shark", species)
species<- gsub("^.*Lemon shark.*$", "Lemon shark", species)
species<- gsub(".*lemon shark.*", "Lemon shark", species, ignore.case = TRUE)
species<-gsub("^\\s*shark\\s*$", "Unidentified shark", species)
species<- gsub(".*bull.*", "Bull shark", species, ignore.case = TRUE)
species<- gsub(".*blue.*", "Blue shark", species, ignore.case = TRUE)
species<- gsub(".*reef.*", "Reef shark", species, ignore.case = TRUE)
species<- gsub(".*sand shark*", "Sand shark", species, ignore.case = TRUE)
species<- gsub("^.*Sand shark.*$", "Sand shark", species)
species<- gsub(".*Sand shark*", "Sand shark", species, ignore.case = TRUE)
species<- gsub(".*sandshark*", "Sand shark", species, ignore.case = TRUE)
species<- gsub(".*Sandbar shark*", "Sand shark", species, ignore.case = TRUE)
species<- gsub("juvenile\\s+\\w+", "Juvenile shark",species, ignore.case = TRUE)
species<- gsub("Juvenile shark shark", "Juvenile shark",species, ignore.case = TRUE)
species<- gsub("Juvenile shark  blacktip shark", "Juvenile shark",species, ignore.case = TRUE)
species<- gsub("blacktip\\s+\\w+", "Blacktip shark",species, ignore.case = TRUE)
species <- gsub("\\black\\w*\\b", "Blacktip shark", species, ignore.case = TRUE)

#Replace all occurrences of the word "blacktip" in a column with "Blacktip Shark," even if the word is accompanied by other words before or after

species <- gsub("\\bblacktip\\b", "Blacktip Shark", species, ignore.case = TRUE)
species<- gsub(".*copper shark*", "Copper shark", species, ignore.case = TRUE)
species <- gsub("\\bcow\\b", "Cow", species)
species <- gsub("\\bsilky\\b", "Silky", species)
species <- gsub("\\bsilvertip\\b", "Silvertip", species)


#Condition: if "Hammerhead" followed by other words then replaced by Hammerhead shark
species <- ifelse(grepl("Hammerhead\\s+\\w+", species, ignore.case = TRUE), "Hammerhead shark", species)
species <- ifelse(grepl("Blacktip\\s+\\w+", species, ignore.case = TRUE), "Blacktip shark", species)
species <- ifelse(grepl("Raggedtooth\\s+\\w+", species, ignore.case = TRUE), "Raggedtooth shark", species)
species <- ifelse(grepl("Porbeagle\\s+\\w+", species, ignore.case = TRUE), "Porbeagle shark", species)


#if the word shark does not appear then NA
species <- ifelse(grepl("shark", species, ignore.case = TRUE), species, NA)
species <- ifelse(grepl("Juvenile\\s+\\w+", species, ignore.case = TRUE), "Juvenile shark", species)
species <- ifelse(grepl("grey\\s+\\w+", species, ignore.case = TRUE), "Grey shark", species)
species <- ifelse(grepl("greycolored\\s+\\w+", species, ignore.case = TRUE), "Grey shark", species)
species <- ifelse(grepl("gray\\s+\\w+", species, ignore.case = TRUE), "Grey shark", species)
species <- ifelse(grepl("Broadnose\\s+\\w+", species, ignore.case = TRUE), "Sevengill shark", species)
species <- ifelse(grepl("7gill\\s+\\w+", species, ignore.case = TRUE), "Sevengill shark", species)
species <- ifelse(grepl("sevengill\\s+\\w+", species, ignore.case = TRUE), "Sevengill shark", species)
species <- ifelse(grepl("black\\s+\\w+", species, ignore.case = TRUE), "Blacktip shark", species)
species <- ifelse(grepl("Sand\\s+\\w+", species, ignore.case = TRUE), "Sand shark", species)
species <- ifelse(grepl("Carpet\\s+\\w+", species, ignore.case = TRUE), "Carpet shark", species)
species <- ifelse(grepl("\\bbrown\\b", species, ignore.case = TRUE), "Brown shark", species)

species <- gsub("^frag\\w+", "Unidentified shark", species)
species <- gsub("^shark$", "Unidentified shark", species)

#Function to check if "small" is present in cell
contains_small <- function(text) {
  return(grepl("small", text))
}
# Replace cell with "unidentified shark" if "small" is present
species <- ifelse(sapply(species, contains_small), "Unidentified shark", species)

#Function to check if "Small" is present in cell
contains_small <- function(text) {
  return(grepl("Small", text))
}
#  Replace cell with "unidentified shark" if "Small" is present
species <- ifelse(sapply(species, contains_small), "Unidentified shark", species)

#Function to check if "sharks" is present in cell
contains_sharks <- function(text) {
  return(grepl("sharks", text))
}
# Replace cell with"unidentified shark" if "sharks" is present
species <- ifelse(sapply(species, contains_sharks), "Unidentified shark", species)

#Function to check if "cookie" is present in cell
contains_cookie <- function(text) {
  return(grepl("cookie", text))
}
# Replace cell with"Cookiecutter shark" if "cookie" is present
species <- ifelse(sapply(species, contains_cookie), "Cookiecutter shark", species)


#Function to check if "involvement" is present in cell
contains_involvement <- function(text) {
  return(grepl("involvement", text))
}
# Replace cell with"No shark" if "involvement" is present
species <- ifelse(sapply(species, contains_involvement), "No shark", species)

#Function to check if "invovlement" is present in cell
contains_invovlement <- function(text) {
  return(grepl("invovlement", text))
}
# Replace cell with"No shark" if "invovlement" is present
species <- ifelse(sapply(species, contains_invovlement), "No shark", species)

#Function to check if "Not" is present in cell
contains_Not <- function(text) {
  return(grepl("Not", text))
}
# Replace cell with"No shark" if "Not" is present
species <- ifelse(sapply(species, contains_Not), "No shark", species)

#Function to check if "not" is present in cell
contains_not <- function(text) {
  return(grepl("not", text))
}
# Replace cell with"No shark" if "not" is present
species <- ifelse(sapply(species, contains_not), "No shark", species)

#Function to check if "Questionable" is present in cell
contains_Questionable <- function(text) {
  return(grepl("Questionable", text))
}
# Replace cell with"No shark" if "Questionable" is present
species <- ifelse(sapply(species, contains_Questionable), "No shark", species)

#Function to check if "questionable" is present in cell
contains_questionable <- function(text) {
  return(grepl("questionable", text))
}
# Replace cell with"No shark" if "questionable" is present
species <- ifelse(sapply(species, contains_questionable), "No shark", species)


#Function to check if "Salmon" is present in cell
contains_Salmon <- function(text) {
  return(grepl("Salmon", text))
}
# Replace cell with"Salmon shark" if "Salmon" is present
species <- ifelse(sapply(species, contains_Salmon), "Salmon shark", species)


#Function to check if "whaler" is present in cell
contains_whaler <- function(text) {
  return(grepl("whaler", text))
}
# Replace cell with"Whale shark" if "involvement" is present
species <- ifelse(sapply(species, contains_whaler), "Whale shark", species)


# Replace the cell with "Unidentified shark" if any of the specified words are found
species <- ifelse(grepl("(seen|observed|Tooth|tooth|large|killed|captive|female|metre|foot|followed|caused|Said|young|probably|gaffed)", species, ignore.case = TRUE), "Unidentified shark", species)

# Replace the cell with "No shark" if any of the specified words are found
species <- ifelse(grepl("(hoax|No Shark)", species, ignore.case = TRUE), "No shark", species)

# Replace the cell with "Copper shark" if any of the specified words are found
species <- ifelse(grepl("(Copper)", species, ignore.case = TRUE), "Copper shark", species)

# Replace the cell with "Dogfish shark" if any of the specified words are found
species <- ifelse(grepl("(Dog|dogfish)", species, ignore.case = TRUE), "Dogfish shark", species)

# Replace the cell with "Dusky shark" if any of the specified words are found
species <- ifelse(grepl("(Dusky)", species, ignore.case = TRUE), "Dusky shark", species)

# Replace the cell with "Sevengill shark" if any of the specified words are found
species <- ifelse(grepl("(gill)", species, ignore.case = TRUE), "Sevengill shark", species)

# Replace the cell with "Angel shark" if any of the specified words are found
species <- ifelse(grepl("(Angel)", species, ignore.case = TRUE), "Angel shark", species)


attacks$Species <- species
attacks$Species <- ifelse(is.na(attacks$Species), "Unidentifies shark", attacks$Species)

#_____________________________________________________________________________________________

#ACTIVITY

#when trying to put everything in lower cap, this was the result: Errore in tolower(attacks$Activity) :  multibyte string 921 not valid --> therefore, we converted everything in ASCII (=American Standard Code for Information Interchange)

attacks$Activity <- iconv(attacks$Activity, to = "ASCII", sub = " ")
attacks$Activity <- gsub("[^ -~]", "", attacks$Activity)

attacks$Activity <- tolower(attacks$Activity)

#when running a table of all activities, we can see that we can group them in some categories: diving, race, windsurfing, walking, wading, wade fishing, touching, swimming, surfing, surf, standing, spearfishing, snorkeling,  scuba diving, playing, paddle, murder, kayaking, kayak, floating, fishing, so that we have standard categories.

attacks$Activity <- gsub("[^A-Za-z ]", "", attacks$Activity) 


kept_activities <- c("diving", "race", "windsurfing", "walking", "wading", "wade fishing", "touching", "swimming", "surfing", "surf", "standing", "spearfishing", "snorkeling", "scuba diving", "playing", "paddle", "murder", "kayaking", "kayak", "floating", "fishing")


# We create an expression pattern that matches any of the specific words
pattern <- paste(kept_activities, collapse = "|")

# Extract only the specific words (kept_activities) from the column and replace the rest with an empty string
attacks$Activity <- str_extract(attacks$Activity, paste0("\\b(?:", pattern, ")\\b"))

attacks <- attacks %>%
  mutate(Activity = str_replace_all(Activity, "\\bkayaking\\b", "kayak")) %>%
  mutate(Activity = str_replace_all(Activity, "\\bsurfing\\b", "surf"))

attacks$Activity[is.na(attacks$Activity)] <- "other" # for all activities that are not part of the list or for NA, we just put Other, since no information is available


#_____________________________________________________________________________________________

#FATAL 
 #When we run a table for the Fatality column, we see that almost observation follow the same category (Y = yes, N = No, Unknwn). An observation has a 2017, which is probably a typo (which we replace by empty case) while another one has an M. We want to believe that it was a typo as well, and that "M" was typed instead of "N".
attacks$Fatal..Y.N. <- gsub("2017", "", attacks$Fatal..Y.N.)
attacks$Fatal..Y.N. <- gsub("M", "N", attacks$Fatal..Y.N.)

attacks$Fatal..Y.N. <-na_if(attacks$Fatal..Y.N., "")

#For the rest of NA, we have no information, so we took of these few observations.
attacks <- subset(attacks, !is.na(Fatal..Y.N.))
#Replace Yes by 1, 0 otherwise
attacks$Fatal..Y.N. <- ifelse(attacks$Fatal..Y.N. == "Y", 1, 0)

names(attacks)[names(attacks) == "Fatal..Y.N."] <- "Fatality"


#_____________________________________________________________________________________________
#final check up:

#We run a table for all columns, in order to check if columns have any missing value or if we still have other mistakes. 

table(attacks$Date) #this one is fine
table(attacks$Year) #this one is fine
table(attacks$Type) #here we have boating and boat which mean the same thing. Let us group them
attacks$Type <- gsub("Boating", "Boat", attacks$Type)
attacks$Type <- gsub("Boatomg", "Boat", attacks$Type)
table(attacks$Type) #this one is fine
table(attacks$Country)#this one is fine
table(attacks$Activity)#this one is fine
table(attacks$Age)#this one is fine
table(attacks$Fatality)#this one is fine
table(attacks$Species)#this one is fine


#With the following code, we will add a new column which will associate each country with its ISO code. This is a very important step for the cleaning if the enxt sections, because il will allow us to match the different datasets. 

# Get ISO country codes
iso_codes <- countrycode(attacks$Country, "country.name", "iso3c")
attacks$ISO_Code <- countrycode(attacks$Country, "country.name", "iso3c")

#Through the following function we can spot NAs. We can see that there are 23 countries with no ISO Code. We will fix them by hand. 
rows_with_na <- which(is.na(attacks$ISO_Code))

attacks$ISO_Code[attacks$Country %in% c("ENGLAND", "SCOTLAND", "BRITISH ISLES") & is.na(attacks$ISO_Code)] <- "GB"
attacks$ISO_Code[attacks$Country %in% c("AZORES") & is.na(attacks$ISO_Code)] <- "PRT"
attacks$ISO_Code[attacks$Country %in% c("ST. MAARTIN", "ST. MARTIN") & is.na(attacks$ISO_Code)] <- "MAF"
attacks$ISO_Code[attacks$Country %in% c("OKINAWA") & is.na(attacks$ISO_Code)] <- "JPN"
attacks$ISO_Code[attacks$Country %in% c("MICRONESIA") & is.na(attacks$ISO_Code)] <- "FSM"
attacks <- subset(attacks, !is.na(ISO_Code))

DT::datatable(attacks, options = list(pageLength = 10))

3.2 Temperatures Cleaning

The main mission for this dataset was to change the entire structure of the dataset. Originally, the dataset had the years as columns (one column for each year), while the rows were represented by the different countries; the temperature observations were presented by month, quarter and year. Furthermore, for each country and each month/quarter/year, the dataset shows both the change in temperatures and the standard deviation.

To proceed with the cleaning, we eliminated the rows corresponding to the standard deviation (we will not work with it), and the monthly and quarterly observations: we are only interested in yearly temperature variations.

We then rotated the dataset so as to have all the years under a single column; we subsequently matched each country and year with its temperature change observation. Finally we added the ISO code to merge with the other datasets.


temperature <- read_xlsx(here::here("data/Temperature.xlsx"))

#R did not read the "°C" on the column Unit, so we changed it manually

temperature <- temperature %>% mutate(Unit = "°C")

#We take off all the columns that we do not need for our study

temperature <- temperature %>% select(-'Area Code (M49)', -'Months Code', -'Element Code')

#We wliminate all columns having an F at the end (For each year there are 2 columns (ex: 1961 - 1961F; 1962 - 1962F). The column with the F at the end of the Year is a column where only the letter E appears for each row. We tried to look for its meaning on the FAO website but we did not succeed. Therefore, we removed it. What is more, it is important to notice that it was in any case unimportant for our work).

columns_to_remove <- grep("F$", names(temperature), value = TRUE)
temperature <- temperature[, !(names(temperature) %in% columns_to_remove)]

# We keep only meteorological year, we don't need to work with monthly and seasonal data.

target_name <- "Meteorological year"
temperature <- temperature[temperature$Months == target_name, ]

target_name2 <- "Temperature change"
temperature <- temperature[temperature$Element == target_name2, ]



#  We take off the rest of the columns that we do not need

temperature <- temperature %>% select(-'Area Code', -'Months', -'Element', -'Unit')

#The column of each year starts with an Y (Y1961, Y1961...). We take it off, so that we can easily match data with other datasets.

new_colnames <- gsub("Y", "", colnames(temperature))
colnames(temperature) <- new_colnames
temperature <- na.omit(temperature)


# We transpose the dataset so that columns become rows. With this function we have a single column, where each row corresponds to a year, but countries became the columns now. We will work on this later in the code.
temperature <- t(temperature)

# I want the first row to be the header

colnames(temperature) <- temperature[1, ]
clean_temperature <- temperature[-1, ]

#col names in CAPS
colnames(clean_temperature) <- toupper(colnames(clean_temperature))


library(tidyr)

# Convert the matrix/array "clean_temperatures" to a data frame
temperature <- as.data.frame(clean_temperature)

# Add 'Year' as a separate column
temperature$Year <- rownames(clean_temperature)

# Reshape the data from wide to long format
temperature <- tidyr::pivot_longer(temperature, 
                                     cols = -Year, 
                                     names_to = "Country", 
                                     values_to = "Temperature")

# Reorder columns as per your desired format
temperature <- temperature[c("Year", "Country", "Temperature")]

#As we said at the beginning, we only work with data from 1970
temperature <- temperature %>% filter(Year >= 1970)
temperature$Year <- as.numeric(temperature$Year)

temperature <- temperature %>% 
  mutate(Country = ifelse(Country == "UNITED STATES OF AMERICA", "USA", Country))

#As for the attacks dataset, we add a column for the ISO code.

temperature$ISO_Code <- countrycode(temperature$Country, "country.name", "iso3c")
temperature$ISO_Code[temperature$Country %in% c("MICRONESIA") & is.na(temperature$ISO_Code)] <- "FSM"

#Now that we put ISO code, we can remove the country column
temperature <- temperature %>% select(-'Country')

DT::datatable(temperature, options = list(pageLength = 10))

3.3 Sea Level Cleaning

We choose a data set about the sea level change observed during the last years. We start analyzing the sea level changes since 1993 because we did not find any data of years before. In this data set we had few columns but we only kept 2 columns that were the more relevant for us: the year and the GMSL_GIA. This last column is the global mean of sea level including the glacial isostatic adjustement (GIA). We choose the column with GIA and not without because this show the response of solid Earth to the past changes in surface loading by ice and water such as the glaciation and deglaciation.


sealevel <- read.csv(here::here("data/sealevel.csv"))

# keep the 2 columns we are interested to analyze because the other are irrelevant for our project
sealevel <- subset(sealevel, select = c(Year, GMSL_GIA))

#Show the year only the first time by creating a new column called Year2
sealevel$Year2 <- ifelse(duplicated(sealevel$Year), NA, sealevel$Year)

# delete column Year due to the creation of column Year 2
sealevel <- subset(sealevel, select = -Year)

# delete NA because it does not bring anything to our analysis
sealevel <- na.omit(sealevel)

#Change name of column Year
column <- gsub("2","",colnames(sealevel))
colnames(sealevel) <- column

DT::datatable(sealevel, options = list(pageLength = 10))

3.4 Greenhouse Gas Emissions Cleaning

Finally, the GHG emissions dataset was the easiest to clean. Indeed, we already have a column for the ISO code, so we simply had to filter the years so that the observations began in 1970.

co2 <- read.csv(here::here("data/CO2.csv"))

# Change names of columns in order to have the same columns that in the other datasets
names(co2)[names(co2) == "Entity"] <- "Country"
names(co2)[names(co2) == "Annual.CO..emissions"] <- "Annual CO2 Emissions"
names(co2)[names(co2) == "Code"] <- "ISO_Code"



# keep the 3 columns we are interested to analyze
co2 <- subset(co2, select = c("ISO_Code", "Year", "Annual CO2 Emissions"))


# only keep information starting in 1970 because we want to look at the evolution of 
# the last 50 years
co2<- co2[co2$Year >= 1970, ]

co2$ISO_Code <- na_if(co2$ISO_Code, "")

co2 <- subset(co2, !is.na(ISO_Code))

DT::datatable(co2, options = list(pageLength = 10))

3.5 Merged Data Set

After cleaning the four datasets, we have finally merged them. This step is essential to be able to run regression across variables that came from multiple datasets.



#MERGE FIRST TWO DATA SETS

# Assuming 'Year' and 'Country' are the columns in both datasets for matching
merged_data <- left_join(attacks, temperature, by = c("Year", "ISO_Code"))

#MERGE WITH THIRD

# Assuming 'Year' is the columns in both datasets for matching
merged_data2 <- left_join(merged_data, sealevel, by = c("Year"))

#MERGE WITH FOURTH

# Assuming 'Year' and 'Country' are the columns in both datasets for matching
merged_data3 <- left_join(merged_data2, co2, by = c("Year", "ISO_Code"))
merged_data3$Temperature <- as.numeric(as.character(merged_data3$Temperature))
# Assuming 'Year' and 'Country' are the columns in both datasets for matching
merged_data3 <- left_join(merged_data2, co2, by = c("Year", "ISO_Code"))

#Change name of a column
colnames(merged_data3)[colnames(merged_data3) == 'Country.x'] <- 'Country'
merged_data3$Temperature <- as.numeric(merged_data3$Temperature)


#for merged_data3, which we will use for our regressions, we need numerical variables
#therefore, we will make changes in some categorical ones

#TYPE

#Let's transform the Type variable on numeric. 1 corresponds to Boat, 2 to Unprovoked, Invalid to 3
# Provoked to 4, Questionable to 5 and Sea Disaster to 6
categories <- c("Boat" = 1, "Unprovoked" = 2, "Invalid" = 3, "Provoked" = 4, "Questionable" = 5, "Sea Disaster" = 6)
merged_data3$Type <- factor(merged_data3$Type, levels = names(categories))
merged_data3$Type <- as.numeric(merged_data3$Type)

#TIME

# Let's focus on the transformation of time where morning correspond to 0, afternoon to 1,
# evening to 2 and night to 3
merged_data3$Time <- as.numeric(factor(merged_data3$Time, levels = c("morning", "afternoon", "evening", "night")))


#SEX

merged_data3$Sex <- ifelse(merged_data3$Sex == "M", 0, 
                      ifelse(merged_data3$Sex == "F", 1, 2))


#SPECIES


merged_data3 <- merged_data3 %>%
  mutate(Species = as.numeric(factor(Species)))

# Get the top 5 species
top_species <- merged_data3 %>%
  count(Species, sort = TRUE) %>%
  slice_head(n = 5)

# Create a logical condition to include only the top 5 species
condition <- merged_data3$Species %in% top_species$Species

# Subset the data based on the condition
species_filtered_data <- merged_data3[condition, ]



DT::datatable(merged_data3, options = list(pageLength = 10))

4. EDA

4.1 Evolution of shark attacks through the years

With the y-axis showing the frequency of shark attacks and the x-axis showing the passage of time, the graph below displays a striking pattern in shark attacks over time. The data exhibits a recognizable pattern that indicates a steady increase in the quantity of shark attacks within the designated period. This rising tendency begs critical concerns regarding what is causing such an increase and necessitates more research into ecological, environmental, and human-related elements that could be involved in this occurrence. The graph emphasizes the need of keeping an eye on and comprehending the dynamics of these episodes in addition to providing a visual depiction of the rise in shark attacks.

This trend is a key research topic: why are marine attacks increasing year after year? What could be the reasons? In fact, the reasons can be various, for example of an ecological, environmental nature or linked to human seaside activity. The graph highlights the need to keep an eye on and understand the dynamics of these episodes, which is exactly what we will do during our study.

Three fundamental areas of this graph should be an element of analysis: 1975-1978; 1988-1990; 2019-2020. In fact, we could ask ourselves, why is there a decrease in attacks in this period? For the last period, the answer is simple: our dataset presents data up to June 2018, for the second half of the year we have no data, which explains the drastic drop in attacks. For the other two periods, we will discover the reasons during a more in-depth analysis

plot1 <- ggplot(data = attacks, aes(x = Year)) +
  geom_smooth(stat = "count", aes(group = 1), color = "blue", size = 1) +
  geom_bar(stat = "count", fill = "red") +
  labs(title = "Shark Attacks Evolution Over Years", x = "Year", y = "Number of Attacks")

plot1

4.2 Pattern of shark attacks frequencies

Pattern of shark attacks frequencies through the day

The bar graph below, where the x-axis indicates four time categories —morning, afternoon, evening, and night— displays the distribution of shark attacks throughout the day. The graph’s most noticeable aspect is how frequently shark attacks occur in the afternoon, followed by the morning, evening and night. This concentration of incidents in the afternoon raises intriguing questions about potential human behavioral factors that could contribute to this temporal pattern.

First and foremost, we believe that the afternoon is when most people choose to go swimming, which contributes to the frequency of shark attacks during this time. Recreational water activities such as swimming, surfing, kayaking, and other water sports are probably more popular during these times. Furthermore, the afternoon is when the seas are the warmest, so it’s intriguing to find out if this aspect also has a significant influence on shark activity. Conversely, the decreased number of assaults that occur at night may be related to less people using the water during that time, but there may be other possible reasons as well (less high temperatures, poorer visibility).

Lastly, we suppose that the medium frequencies of the morning and evening might correspond to transitional periods when shark and human behavior are changing.


bar_colors <- c("#66c2a5", "#fc8d62", "#8da0cb", "#e78ac3")

# Create a sorted table
sorted_table <- table(attacks$Time)
sorted_table <- sorted_table[order(-sorted_table)]

# Create a bar plot with the sorted data
barplot1 <- barplot(sorted_table,
                    col = bar_colors,
                    xlab = "Time of Day", ylab = "Frequency of Attacks",
                    main = "Frequency of Shark Attacks by Time of Day",
                    border = "white",
                    ylim = c(0, 1800),
                    space = 0.5,
                    cex.names = 0.8,
                    font.axis = 2,
                    beside = TRUE)


# Add text labels with frequencies on the bars
text_labels <- sorted_table
text(x = barplot1, y = sorted_table, labels = text_labels, pos = 3, cex = 0.8, col = "black")

Pattern of shark attacks frequencies per country

A first glance at our plotted data reveals that the United States, Australia, and South Africa are the three primary countries with a significant number of shark attacks.

shark_attacks_by_country <- attacks %>%
  group_by(Country) %>%
  summarise(total_attacks = n())

shark_attacks_by_country <- shark_attacks_by_country %>%
  mutate(CountryCategory = ifelse(total_attacks >= 30, as.character(Country), "Other"))

# Order the countries by frequency in descending order
shark_attacks_by_country$CountryCategory <- reorder(shark_attacks_by_country$CountryCategory, -shark_attacks_by_country$total_attacks)

# Plot the data
plot2<- ggplot(data = shark_attacks_by_country, aes(x = total_attacks, y = CountryCategory)) +
  geom_col(fill = "skyblue") +
  labs(title = "Total Shark Attacks in Each Country", x = "Number of Attacks", y = "Country") +
  theme_minimal() +
  theme(axis.text.y = element_text(hjust = 1)) +
  scale_y_discrete(labels = scales::label_wrap(10))

interactive_plot <- ggplotly(plot2)
interactive_plot

4.3 Fatality of shark attacks

Are shark attacks more likely to lead to death?

IT WAS A PIECHART…

Fatalities by type of shark

As we can see in this graph, most of attacks were not fatal, meaning that sharks injured a lot of people without taking their lives away. The types of sharks that did the most of accidents were the following ones: Bull shark, Tiger shark and White shark, with the deadlies being the last one. Why do they commit attacks? Probably because their feeding behavior. Maybe they misidentify their predators attacking people instead of animals like turtles, seals, or sea lions. This is consistent with Bull shark and Tiger shark being repetitevly named as the deadlies sharks in the seas and oeceans (CBS News, 2010). Another relevant information derived from this graph is the big number of unidentified sharks. Out of all the attacks recorded by our dataset, we have almost 600 Unidentified species. It is such a pity not to be able to have this information, which would have led us to more insightful results.